PyLadiesCon 2025 | December 5-7, 2025
Understand when to use stratified sampling over random sampling
Learn how to define subgroups (strata) and implement sampling in Python
Evaluate sample representativeness by checking distribution patterns
Apply weights to adjust for unequal proportions in score estimation
Today’s presentation
Demonstrate how stratified random sampling to identify subgroups - Post-stratification weights are applied to correct unequal representation
Challenges in surveying large populations: time, budget, and resource constraints
Importance of representative samples in research to reduce bias
Align research outcomes with actionable goals: informing policy, guiding interventions, or enhancing program effectiveness
Allow for subgroup analysis to understand diverse perspectives within the population
Figure 1. Types of Sampling Methods with Illustration (Generated with Copilot)
Stratified sampling is a research design method that involves dividing a population into smaller subgroups known as strata.
Strata are formed based on shared attributes or characteristics (e.g., age group, gender, geographic location).
A sampling technique, such as random, systematic, or convenience sampling, is then applied to select participants in each stratum.
📘 Examples
Systematic sampling within strata: You divided a college students population based on their major and then select every kth element based on the size needed from each stratum.
Stratified random sampling within strata: You divided a college students population based on their major and then randomly select a proportionate number of students from each major.
Figure 2. Stratified Random Sampling Workflow Diagram (Generated with Copilot)
In this example, the population is defined as 25,000 college students.
Variables in the population dataset:
| StudentID | Gender | Age | Race | Ethnicity | StudentType | OverallGPA | |
|---|---|---|---|---|---|---|---|
| 0 | SID000001 | Male | 18 | White | Hispanic or Latino | Freshman | 4.00 |
| 1 | SID000002 | Female | 40 | Two or more races | Hispanic or Latino | Continuing Education | 0.00 |
| 2 | SID000003 | Female | 19 | White | Not Hispanic or Latino | Junior | 3.23 |
The ideal sample size for an infinite population is given by: \[ n_0 = \frac{Z^2 \, p \, (1 - p)}{e^2} \]
where:
The sample size adjusted for a finite population is given by:
\[ n = \frac{n_0}{1 + \left( \frac{n_0 - 1}{N} \right)} \]
where:
# Cochran's Sample Size Calculation
import scipy.stats as stats
def cochran_sample_size(Z, p, e, N):
n0 = (Z**2 * p * (1 - p)) / (e**2)
n = n0 / (1 + ((n0 - 1) / N))
return int(n)
# Parameters
Z = stats.norm.ppf(0.975) # Calculate the critical z-value for a 95% confidence level in a standard normal distribution
p = 0.5 # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e = 0.05 # Margin of error (precision level, more precision requires larger sample size)
N = 25000 # Finite population size (in this example, student population size)
# Calculate sample size
# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e**2)
sample_size = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0): {round(n0)}")
print(f"Adjusted sample size for finite population: {int(sample_size)}")Ideal sample size (n0): 384
Adjusted sample size for finite population: 378
Example 1: By increments of 2 or 10
# Binding age groups by increments of 2 years for younger ages and 10 years for older ages
population_data['Age_Group_1'] = pd.cut(population_data['Age'],
bins=[17, 19, 21, 23, 25, 27, 29, 31, 41, 50],
labels=['17-18', '19-20', '21-22', '23-24', '25-26', '27-28', '29-30', '31-40', '41-50'], right=False)
# View the distribution after recoding for both counts and proportions
# Make a list of two outputs: counts and proportions
# To view counts
age_1_group_counts = population_data['Age_Group_1'].value_counts().sort_index()
# To view proportions
age_1_group_proportions = population_data['Age_Group_1'].value_counts(normalize=True).sort_index().round(3)
age_1_group_counts, age_1_group_proportions(Age_Group_1
17-18 10123
19-20 8896
21-22 4222
23-24 506
25-26 329
27-28 277
29-30 235
31-40 400
41-50 12
Name: count, dtype: int64,
Age_Group_1
17-18 0.405
19-20 0.356
21-22 0.169
23-24 0.020
25-26 0.013
27-28 0.011
29-30 0.009
31-40 0.016
41-50 0.000
Name: proportion, dtype: float64)
Example 2: By meaningful groups
# We will keep the first three groups like the previous example, but then group all older ages into one group, as 31 and above.
population_data['Age_Group_2'] = pd.cut(population_data['Age'],
bins=[17, 19, 21, 23, 25, 31, 44],
labels=['17-18', '19-20', '21-22', '23-25', '26-30', '31+'],
right=False)
# View the distribution after recoding for both counts and proportions
# To view counts
age_2_group_counts = population_data['Age_Group_2'].value_counts().sort_index()
# To view proportions
age_2_group_proportions = population_data['Age_Group_2'].value_counts(normalize=True).sort_index().round(3)
age_2_group_counts, age_2_group_proportions(Age_Group_2
17-18 10123
19-20 8896
21-22 4222
23-25 506
26-30 841
31+ 411
Name: count, dtype: int64,
Age_Group_2
17-18 0.405
19-20 0.356
21-22 0.169
23-25 0.020
26-30 0.034
31+ 0.016
Name: proportion, dtype: float64)
Reminder: The proportion of each subgroup should be calculated after re-coding. If you adjust any levels within a variable, it can affect the distribution of values in the dataset. In practice, you may repeat this step until you are able to draw a proper set of subgroups before creating the strata.
The strata (subgroups) are determined by the following categories:
Gender (Female or Male)
Race (Asian, Black or African American, White, Two or more races, Other)
Age Group (16-18, 19-20, 21-22, 23-30, 31+)
Code example: Proportion by each stratum
# Create a distribution chart for each of the three variables after recoding
import matplotlib.pyplot as plt
# Proportion of Gender in the Sample Selection Data and add the proportion on top of each bar
gender_proportions = sample_selection_data['Gender_m'].value_counts(normalize=True).round(2)
gender_proportions.plot(kind='bar', title='Proportion of Gender')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(gender_proportions):
plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()# Proportion of Race in the Sample Selection Data
race_proportions = sample_selection_data['Race_m'].value_counts(normalize=True).round(2)
race_proportions.plot(kind='bar', title='Proportion of Race')
plt.xticks(rotation=0)
# Round to 2 decimal places and convert to percentage
for i, v in enumerate(race_proportions):
plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()# Proportion of Age Group in the Sample Selection Data
age_proportions = sample_selection_data['Age_Group_2'].value_counts(normalize=True).round(2)
age_proportions.plot(kind='bar', title='Proportion of Age Group')
plt.xticks(rotation=0)
for i, v in enumerate(age_proportions):
plt.text(i, v + 0.01, str(round(v * 100, 2)) + '%', ha='center')
plt.show()# This time we are using a different set of parameters. The formula remains the same from Step 2.
# Parameters
Z = stats.norm.ppf(0.975) # Calculate the critical z-value for
# a 95% confidence level in a standard normal distribution
p = 0.5 # Estimated proportion (use 0.5 for maximum variability, sort of like worst-case scenario)
e_3 = 0.03 # Margin of error (precision level, more precision requires larger sample size)
N = 25000 # Finite population size (in this example, student population size)
# Calculate sample size
# Calculate n0 and adjusted sample size for finite population
n0 = (Z**2 * p * (1 - p)) / (e_3**2)
sample_size_larger_e = n0 / (1 + ((n0 - 1) / N))
print(f"Ideal sample size (n0) for 95% confidence level with 3% margin of error: {round(n0)}")
print(f"Adjusted sample size of finite population for 95% confidence level with 3% margin of error: {int(sample_size_larger_e)}")Ideal sample size (n0) for 95% confidence level with 3% margin of error: 1067
Adjusted sample size of finite population for 95% confidence level with 3% margin of error: 1023
Then, we apply the proportions to get the sample of this updated sample size.
# Determine the combined proportions of the strata in the sample selection data with the modified variables
stratum_set_1 = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
stratum_set_1['Proportion'] = stratum_set_1['Count'] / stratum_set_1['Count'].sum()
# use the 3% sample size to draw the stratified random sample
stratum_sizes_three_moe = (stratum_set_1['Proportion'] * int(sample_size_larger_e)).round().astype(int)
# Add the grouping variables back to the stratum sizes DataFrame for clarity
stratum_sizes_three_moe = pd.concat([stratum_set_1[['Gender_m', 'Race_m', 'Age_Group_2']], stratum_sizes_three_moe.rename('Sample_Size')], axis=1)
print("Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):")
stratum_sizes_three_moe.head(3)Stratum sizes for each subgroup based on 3% margin of error sample size (first 3 rows):
| Gender_m | Race_m | Age_Group_2 | Sample_Size | |
|---|---|---|---|---|
| 0 | Female | Asian | 17-18 | 25 |
| 1 | Female | Asian | 19-20 | 21 |
| 2 | Female | Asian | 21-22 | 10 |
Draw the stratified random sample based on the calculated stratum sizes.
# Initialize an empty list to hold sampled dataframes
sampled_groups = []
for i, row in stratum_sizes_three_moe.iterrows():
stratum_filter = (
(sample_selection_data['Gender_m'] == row['Gender_m']) &
(sample_selection_data['Race_m'] == row['Race_m']) &
(sample_selection_data['Age_Group_2'] == row['Age_Group_2'])
)
stratum_population = sample_selection_data[stratum_filter]
# Use the sample size from the current row
stratum_sample_size = int(row['Sample_Size'])
# Ensure we don't sample from empty strata or more than available
n = min(stratum_sample_size, len(stratum_population))
if n > 0:
stratum_sample = stratum_population.sample(n=n, replace=False)
sampled_groups.append(stratum_sample)
# Combine all sampled dataframes into a single dataframe (guard against empty list)
if sampled_groups:
sampled_dataset = pd.concat(sampled_groups).reset_index(drop=True)
else:
sampled_dataset = pd.DataFrame(columns=sample_selection_data.columns)
# Create an alias expected by downstream cells and align column naming
sampled_data = sampled_dataset.copy()
if 'Age_Group' not in sampled_data.columns and 'Age_Group_2' in sampled_data.columns:
sampled_data['Age_Group'] = sampled_data['Age_Group_2']
# View the first three rows of the sampled dataset
print("Sampled dataset (first 3 rows):")
sampled_data[['StudentID', 'Gender_m', 'Race_m', 'Age_Group']].head(3)Sampled dataset (first 3 rows):
| StudentID | Gender_m | Race_m | Age_Group | |
|---|---|---|---|---|
| 0 | SID003968 | Female | Asian | 17-18 |
| 1 | SID009351 | Female | Asian | 17-18 |
| 2 | SID002933 | Female | Asian | 17-18 |
Stratum are indexed based on the order in the DataFrame.
Sampled Dataset (n = 1024)
print("Sampled dataset size:", len(sampled_dataset))
# Calculate the proportions in the sampled data
sample_proportions = sampled_dataset.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
sample_proportions['Proportion'] = sample_proportions['Count'] / sample_proportions['Count'].sum()
# View the distribution of the sampled data in a chart
import matplotlib.pyplot as plt
# Chart for Sampled Data Proportions
plt.figure(figsize=(5, 3))
plt.bar(range(len(sample_proportions)), sample_proportions['Proportion'])
plt.title('Sampled Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(sample_proportions)), sample_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the sampled proportions DataFrame
sample_proportions[['Proportion']].round(3).head(5)Sampled dataset size: 1024
| Proportion | |
|---|---|
| 0 | 0.024 |
| 1 | 0.021 |
| 2 | 0.010 |
| 3 | 0.001 |
| 4 | 0.002 |
Population Dataset (N = 25,000)
print("population:", len(sample_selection_data))
# Calculate the proportions in the population data
population_proportions = sample_selection_data.groupby(['Gender_m', 'Race_m', 'Age_Group_2'], observed=True).size().reset_index(name='Count')
population_proportions['Proportion'] = population_proportions['Count'] / population_proportions['Count'].sum()
# View the distribution of the population data in a chart
plt.figure(figsize=(5, 3))
plt.bar(range(len(population_proportions)), population_proportions['Proportion'])
plt.title('Population Data Proportions by Stratum')
plt.xlabel('Stratum Index')
plt.ylabel('Proportion')
plt.xticks(range(len(population_proportions)), population_proportions.index, rotation=90)
plt.tight_layout()
plt.show()
# Display the population proportions DataFrame
population_proportions[['Proportion']].head(5).round(3)population: 25000
| Proportion | |
|---|---|
| 0 | 0.024 |
| 1 | 0.021 |
| 2 | 0.010 |
| 3 | 0.001 |
| 4 | 0.002 |
Once verified, the sampled dataset can be used for survey distribution.
Essential when sample proportions differ from population proportions.
Corrects over- or under-representation of certain groups and improves accuracy and validity of survey results.
Post-Stratification weights can be calculated in different ways:
Each stratum individually
Combined categories (joint distribution considered)
Note. weights can be a fraction, but are always positive and non-zero.
Today’s demonstration is similar to individual strata weighting but accounts for joint distribution.
Each weight is determined by dividing the sample proportion by the respondent proportion of each variable. Then, multiply the variable weights together to get the overall weight for each respondent.
In general, the formula for calculating weight for each variable is:
\[ \text{Weight} = \dfrac{\text{Sample Proportion}}{\text{Respondent Proportion}} \]
Because we used a combination of variables in the strata, each participant should be assigned an overall weight:
\[ W_{\text{overall}} = W_{\text{race}} \times W_{\text{gender}} \times W_{\text{age}} \]
The overall weight assigned to each respondent is then used in survey item analysis.
This table shows an example of weight calculation for a respondent (rounded to 3 decimal places for illustration purposes):
| Variables | Participant Value | Sample Proportion | Respondent Proportion | Variable weight = sample_p/respondent_p |
|---|---|---|---|---|
| Gender | Female | .574 | .631 | .910 |
| Age Group | 17-18 | .404 | .261 | 1.548 |
| Race | White | .568 | .556 | 1.022 |
The final weight for this respondent is \[ 0.910*1.548*1.022 = 1.439 \]
Synthetic respondents data based on a survey distributed to the sampled dataset.
There are 241 respondents in the survey results.
import pandas as pd
data_folder = './Data Files/'
# Import the respondents data file (synthetic respondents data based on a survey)
respondents_data = pd.read_excel(data_folder + 'respondents_file.xlsx')
# Make the output view smaller for presentation
print("Respondents dataset (first 3 rows):")
respondents_data[:3]Respondents dataset (first 3 rows):
| Q1_Gender | Q2_Race | Q3_AgeGroup | Q4_Traffic_Accidents | Q5_Academic_Performance | |
|---|---|---|---|---|---|
| 0 | Male | White | 17-18 | 4.0 | 4.0 |
| 1 | Female | White | 17-18 | 3.0 | 3.0 |
| 2 | Female | White | 17-18 | 5.0 | 5.0 |
# Recall the propotions from the sampled data
sampled_data_proportions = {
"gender_sampled": sampled_dataset['Gender_m'].value_counts(normalize=True).round(3).to_dict(),
"race_sampled": sampled_dataset['Race_m'].value_counts(normalize=True).round(3).to_dict(),
"age_group_sampled": sampled_dataset['Age_Group_2'].value_counts(normalize=True).round(3).to_dict()
}
## Calculate the respondents demographic proportions of the three variables: Gender, Race, and Age Group and create a dictionary by the propotion and round to three decimal places
respondent_proportions = {
"Gender_r": respondents_data['Q1_Gender'].value_counts(normalize=True).round(3).to_dict(),
"Race_r": respondents_data['Q2_Race'].value_counts(normalize=True).round(3).to_dict(),
"Age_Group_r": respondents_data['Q3_AgeGroup'].value_counts(normalize=True).round(3).to_dict()
}
# Define the two dictionaries for population and respondents proportions
popl_dict = sampled_data_proportions
resp_dict = respondent_proportions
# Using each of the labels in the variables, then the population or respondent proportions can be retrieved from the dictionaries.
# For example, if Q1_Gender is 'Female', then the population proportion of 0.574 will be retrieved.
# Then, the respondent proportion, 'Gender_r' of 0.631 will be retrieved for the denominator.
# Now we can calculate the weights for each respondent based on the three variables
def calculate_weight(row, popl_dict, resp_dict):
gender_weight = popl_dict['gender_sampled'].get(row['Q1_Gender'], 0) / resp_dict['Gender_r'].get(row['Q1_Gender'], 1)
race_weight = popl_dict['race_sampled'].get(row['Q2_Race'], 0) / resp_dict['Race_r'].get(row['Q2_Race'], 1)
age_group_weight = popl_dict['age_group_sampled'].get(row['Q3_AgeGroup'], 0) / resp_dict['Age_Group_r'].get(row['Q3_AgeGroup'], 1)
total_weight = gender_weight * race_weight * age_group_weight
return total_weightPreview the weights calculated for the first three respondents.
# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['O_Weight'] = respondents_data.apply(calculate_weight, axis=1, args=(popl_dict, resp_dict))
# View the first three rows with the calculated weights
print("Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):")
respondents_data[['Q1_Gender', 'Q2_Race', 'Q3_AgeGroup', 'O_Weight']].head(3).round(3)Respondents data with calculated weights (first 3 rows, rounded to 3 decimal places):
| Q1_Gender | Q2_Race | Q3_AgeGroup | O_Weight | |
|---|---|---|---|---|
| 0 | Male | White | 17-18 | 1.830 |
| 1 | Female | White | 17-18 | 1.442 |
| 2 | Female | White | 17-18 | 1.442 |
Note. Due to decimal rounding, some weights will be slightly different from manual calculations in the previous slide.
Question: On a scale from 0-5, how often does drug abuse lead to traffic accidents among college students?
Scale: 0 - Never to 5 - Very Often
# Apply the weight calculation to each row in the respondents_data DataFrame
respondents_data['Q4_with_weights'] = respondents_data['Q4_Traffic_Accidents'] * respondents_data['O_Weight']
# View the first three rows with the calculated weighted scores
print("Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):")
respondents_data[['Q4_Traffic_Accidents', 'O_Weight', 'Q4_with_weights']].head(3).round(3)Respondents data with weighted Q4_Traffic_Accidents scores (first 3 rows, rounded to 3 decimal places):
| Q4_Traffic_Accidents | O_Weight | Q4_with_weights | |
|---|---|---|---|
| 0 | 4.0 | 1.830 | 7.320 |
| 1 | 3.0 | 1.442 | 4.326 |
| 2 | 5.0 | 1.442 | 7.210 |
Next, we can compare the non-weighted mean to the weighted mean of the survey item.
traffic_accident_response = respondents_data[['Q4_Traffic_Accidents', 'Q4_with_weights']].copy()
non_weighted_mean = traffic_accident_response['Q4_Traffic_Accidents'].mean()
weighted_mean = traffic_accident_response['Q4_with_weights'].mean()
print(f"Non-weighted mean of Q4_Traffic_Accidents: {non_weighted_mean:.3f}")
print(f"Weighted mean of Q4_Traffic_Accidents: {weighted_mean:.3f}")Non-weighted mean of Q4_Traffic_Accidents: 4.606
Weighted mean of Q4_Traffic_Accidents: 4.940
Visual comparison of the non-weighted and weighted scores with histogram:
# Visual comparison of non-weighted vs weighted means with histogram
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.clf() # Clear any previous plots
# Histogram of non-weighted responses
n, bins, patches = plt.hist(
respondents_data['Q4_Traffic_Accidents'],
bins=6,
alpha=0.5,
label='Non-weighted',
color='blue',
edgecolor='black'
)
# Histogram of weighted responses
n_w, bins_w, patches_w = plt.hist(
respondents_data['Q4_with_weights'],
bins=6,
alpha=0.5,
label='Weighted',
color='orange',
edgecolor='black'
)
plt.xlabel('Q4_Traffic_Accidents (0=Never, 5=Very Often)')
plt.ylabel('Frequency')
plt.title('Distribution of Q4_Traffic_Accidents Responses')
plt.legend()
plt.tight_layout()
plt.show()The distribution of responses is more balanced after applying weights, indicating that the weighted mean better reflects the population characteristics.
Abstract: Stratified sampling helps ensure representative data subsets, especially when working with imbalanced groups. In this talk, we’ll explore its use in survey analysis, demonstrate a Python-based implementation, and share best practices for improving data reliability and reducing bias. Attendees will gain a clear understanding of when and how to use stratified sampling effectively in real-world scenarios.
Description: Whether you’re conducting surveys, building predictive models, or working with imbalanced datasets, stratified sampling is a powerful technique for ensuring your data is representative and your insights are reliable.
In survey research, particularly with diverse populations like college students, simple random sampling can lead to under-representation of key subgroups. Stratified sampling addresses this by dividing the population into distinct strata (e.g., age groups, majors, demographics) and sampling proportionally from each. This approach reduces sampling bias and improves the reliability of statistical estimates.
In this presentation, I will demonstrate how to use Python to implement stratified sampling in a student survey at our college. By maintaining proportional representation across age groups, we ensured that our sample reflected the diversity of the student body. I’ll walk through the conceptual foundations of stratified sampling, compare it with simple random sampling, and highlight its advantages in survey research and data analysis.
We’ll also explore how to calculate weighted means when sample proportions differ from population proportions, which is a crucial step in producing unbiased estimates. Code examples and visualizations will be shared via GitHub to support hands-on learning and reproducibility.